Search CORE

111 research outputs found

Integrating E-Commerce and Data Mining: Architecture and Challenges

Author: Ansari Suhail
Kohavi Ron
Mason Llew
Zheng Zijian
Publication venue
Publication date: 01/01/2000
Field of study

We show that the e-commerce domain can provide all the right ingredients for successful data mining and claim that it is a killer domain for data mining. We describe an integrated architecture, based on our expe-rience at Blue Martini Software, for supporting this integration. The architecture can dramatically reduce the pre-processing, cleaning, and data understanding effort often documented to take 80% of the time in knowledge discovery projects. We emphasize the need for data collection at the application server layer (not the web server) in order to support logging of data and metadata that is essential to the discovery process. We describe the data transformation bridges required from the transaction processing systems and customer event streams (e.g., clickstreams) to the data warehouse. We detail the mining workbench, which needs to provide multiple views of the data through reporting, data mining algorithms, visualization, and OLAP. We con-clude with a set of challenges.Comment: KDD workshop: WebKDD 200

arXiv.org e-Print Archive

CiteSeerX

Statistical Challenges in Online Controlled Experiments: A Review of A/B Testing Methodology

Author: Deng Alex
Kohavi Ron
Larsen Nicholas
Sengupta Srijan
Stallrich Jonathan
Stevens Nathaniel
Publication venue
Publication date: 02/08/2023
Field of study

The rise of internet-based services and products in the late 1990's brought about an unprecedented opportunity for online businesses to engage in large scale data-driven decision making. Over the past two decades, organizations such as Airbnb, Alibaba, Amazon, Baidu, Booking, Alphabet's Google, LinkedIn, Lyft, Meta's Facebook, Microsoft, Netflix, Twitter, Uber, and Yandex have invested tremendous resources in online controlled experiments (OCEs) to assess the impact of innovation on their customers and businesses. Running OCEs at scale has presented a host of challenges requiring solutions from many domains. In this paper we review challenges that require new statistical methodologies to address them. In particular, we discuss the practice and culture of online experimentation, as well as its statistics literature, placing the current methodologies within their relevant statistical lineages and providing illustrative examples of OCE applications. Our goal is to raise academic statisticians' awareness of these new research opportunities to increase collaboration between academia and the online industry

arXiv.org e-Print Archive

The Optimisation of Bayesian Classifier in Predictive Spatial Modelling for Secondary Mineral Deposits

Author: Adamu
Gregory
Jack
Jeffrey
Nir
Nir Friedman
Pierre
Pierre Legendre
Ron Kohavi
Publication venue: 'Elsevier BV'
Publication date: 01/01/2015
Field of study

This paper discusses the general concept of Bayesian Network classifier and the optimisation of a predictive spatial model using Naive Bayes (NB) on secondary mineral deposit data. A different NB modelling approaches to mineral distribution data was used to predict the occurrence of a particular mineral deposit in a given area, which include; predictive attributes sub-selection, normalised attributes selection, NB dependent attributes and the strictness to NB model assumptions of attributes independence selection. The performance of the model was determined by selecting a model with the best predictive accuracy. The NB classifier that violates assumptions of attributes independence was used to compare with other forms of NB. The aim is to improve the general performance of the model through the best selection of predictive attribute data. The paper elaborates the workings of a Bayesian Network learning model, the concept of NB and its application to predicting mineral deposit potentials. The result of the optimised NB model based on predictive accuracies and the Receivr Operating Characteristics (ROC) value is also determined

Elsevier - Publisher Connector

Crossref

White Rose Research Online

Is the Stack Distance Between Test Case and Method Correlated With Test Effectiveness?

Author: Acree Allen Troy
Chawla Nitesh V
Jefferson Offutt A
Ji Changbin
Kohavi Ron
Marko Ivanković Goran Petrović
Niedermayr Rainer
Schuler David
Strug Joanna
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 13/03/2019
Field of study

Mutation testing is a means to assess the effectiveness of a test suite and its outcome is considered more meaningful than code coverage metrics. However, despite several optimizations, mutation testing requires a significant computational effort and has not been widely adopted in industry. Therefore, we study in this paper whether test effectiveness can be approximated using a more light-weight approach. We hypothesize that a test case is more likely to detect faults in methods that are close to the test case on the call stack than in methods that the test case accesses indirectly through many other methods. Based on this hypothesis, we propose the minimal stack distance between test case and method as a new test measure, which expresses how close any test case comes to a given method, and study its correlation with test effectiveness. We conducted an empirical study with 21 open-source projects, which comprise in total 1.8 million LOC, and show that a correlation exists between stack distance and test effectiveness. The correlation reaches a strength up to 0.58. We further show that a classifier using the minimal stack distance along with additional easily computable measures can predict the mutation testing result of a method with 92.9% precision and 93.4% recall. Hence, such a classifier can be taken into consideration as a light-weight alternative to mutation testing or as a preceding, less costly step to that.Comment: EASE 201

arXiv.org e-Print Archive

Crossref

Emerging trends in business analytics

Author: Becker B.
Berry M.
Evangelos Simoudis
Kimball R.
Neal J. Rothleder
Ron Kohavi
Thearling K.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date
Field of study

Crossref

Private Summation in the Multi-Message Shuffle Model

Author: Balle Borja
Cheu Albert
Ghazi Badih
Ghazi Badih
Hubert Chan T.-H.
Impagliazzo Russell
Ishai Yuval
Kohavi Ron
Shi Elaine
Wang Yu-Xiang
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 11/11/2020
Field of study

The shuffle model of differential privacy (Erlingsson et al. SODA 2019; Cheu et al. EUROCRYPT 2019) and its close relative encode-shuffle-analyze (Bittau et al. SOSP 2017) provide a fertile middle ground between the well-known local and central models. Similarly to the local model, the shuffle model assumes an untrusted data collector who receives privatized messages from users, but in this case a secure shuffler is used to transmit messages from users to the collector in a way that hides which messages came from which user. An interesting feature of the shuffle model is that increasing the amount of messages sent by each user can lead to protocols with accuracies comparable to the ones achievable in the central model. In particular, for the problem of privately computing the sum of

n

bounded real values held by

n

different users, Cheu et al. showed that

O(\sqrt{n})

messages per user suffice to achieve

O(1)

error (the optimal rate in the central model), while Balle et al. (CRYPTO 2019) recently showed that a single message per user leads to

\Theta(n^{1/3})

MSE (mean squared error), a rate strictly in-between what is achievable in the local and central models. This paper introduces two new protocols for summation in the shuffle model with improved accuracy and communication trade-offs. Our first contribution is a recursive construction based on the protocol from Balle et al. mentioned above, providing

\mathrm{poly}(\log \log n)

error with

O(\log \log n)

messages per user. The second contribution is a protocol with

O(1)

error and

O(1)

messages per user based on a novel analysis of the reduction from secure summation to shuffling introduced by Ishai et al. (FOCS 2006) (the original reduction required

O(\log n)

messages per user).Comment: Published at CCS'2

arXiv.org e-Print Archive

Crossref

On the discriminative power of Hyper-parameters in Cross-Validation and how to choose them

Author: Anelli Vito Walter
Bellogín Alejandro
Bennett James
Bergstra James
Cremonesi Paolo
de Souza Bruno Feres
Hutter Frank
Hutter Frank
Kohavi Ron
Rendle Steffen
Smith Michael R.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 05/09/2019
Field of study

Hyper-parameters tuning is a crucial task to make a model perform at its best. However, despite the well-established methodologies, some aspects of the tuning remain unexplored. As an example, it may affect not just accuracy but also novelty as well as it may depend on the adopted dataset. Moreover, sometimes it could be sufficient to concentrate on a single parameter only (or a few of them) instead of their overall set. In this paper we report on our investigation on hyper-parameters tuning by performing an extensive 10-Folds Cross-Validation on MovieLens and Amazon Movies for three well-known baselines: User-kNN, Item-kNN, BPR-MF. We adopted a grid search strategy considering approximately 15 values for each parameter, and we then evaluated each combination of parameters in terms of accuracy and novelty. We investigated the discriminative power of nDCG, Precision, Recall, MRR, EFD, EPC, and, finally, we analyzed the role of parameters on model evaluation for Cross-Validation.Comment: 5 pages RecSys 201

arXiv.org e-Print Archive

Crossref

Controlled experiments on the web: survey and practical guide

Author: C Hopkins
Dan Sommerfield
DD Boos
H Manning
M Burns
OL Davies
Randal M. Henne
RL Plackett
Roger Longbotham
Ron Kohavi
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Deep Weighted Averaging Classifiers

Author: Aggarwal Charu C.
Goodfellow Ian J.
Guo Chuan
Hendrycks Dan
Kim Been
Kohavi Ron
Lee Kimin
Lei Tao
Liang Shiyu
Maas Andrew L.
Mueller Jonas
Ribeiro Marco Túlio
Scott
Vovk Vladimir
Wang Fulton
Watson Geoffrey S.
Weinberger Kilian Q.
Xing Eric P.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 18/11/2018
Field of study

Recent advances in deep learning have achieved impressive gains in classification accuracy on a variety of types of data, including images and text. Despite these gains, however, concerns have been raised about the calibration, robustness, and interpretability of these models. In this paper we propose a simple way to modify any conventional deep architecture to automatically provide more transparent explanations for classification decisions, as well as an intuitive notion of the credibility of each prediction. Specifically, we draw on ideas from nonparametric kernel regression, and propose to predict labels based on a weighted sum of training instances, where the weights are determined by distance in a learned instance-embedding space. Working within the framework of conformal methods, we propose a new measure of nonconformity suggested by our model, and experimentally validate the accompanying theoretical expectations, demonstrating improved transparency, controlled error rates, and robustness to out-of-domain data, without compromising on accuracy or calibration.Comment: 13 pages, 8 figures, 5 tables, added DOI and updated to meet ACM formatting requirements, In Proceedings of FAT* (2019

arXiv.org e-Print Archive

Crossref

Disjunctions of Conjunctions, Cognitive Simplicity, and Consideration Sets

Author: Breiman Leo
Cormen Thomas H.
Efron Bradley
Gigerenzer Gerd, Peter M. Todd, and the ABC Research Group
Hastie Trevor
Kohavi Ron
Langley Pat
Vapnik Vladimir
Publication venue: 'American Marketing Association (AMA)'
Publication date
Field of study

Crossref